UNIVERSITY OF CALGARY Feature selection for cancer classification using microarray gene expression data by Wenyan Zhong A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE
نویسنده
چکیده
The rapid development of DNA microarray technology enables researchers to measure the expression levels of thousands of genes simultaneously and allows biologists easily gain insight into the complex interaction in tumours on gene expression levels. Its application in cancer studies has been shown great success in both diagnosis and elucidating the pathological mechanism. However, DNA microarray data usually contains thousands of genes and most of them are proved to be uninformative and redundant. Meanwhile, small size of samples of microarray data undermines the diagnosis accuracy of statistical models. Thereby, selecting highly discriminative genes from raw gene expression data can improve the performance of cancer classification and cut down the cost of medical diagnosis. This M.Sc. thesis proposes and investigates a new method of selecting highly discriminative genes for cancer classification based on DNA microarray data. For two-group classification problem, the Bhattacharyya distance is proposed to measure the dissimilarity in gene expression levels between the two groups. For any particular gene, we calculate the Bhattacharrya distance between the two groups based on the expression levels of that particular gene. We use the calculated distances, one for each gene, as a criteria to rank all the genes. Finally, support vector machine is utilized to obtain the optimal subset of genes achieving the lowest misclassification rate. Compared with the other two methods, SWKC (supervised weighted kernel clustering) (Shim et al., 2009) and SVM-RFE (support vector machine with recursive feature elimination) (Guyon et al., 2002), the proposed method is shown to be more effective and sensitive to differentially expressed genes. In the simulation study, the proposed method has much higher recovery rate than the other two methods. Comparisons among these three gene selection methods are also made through two real DNA microarray datasets, the colon dataset and the leukemia dataset, that are publicly available. Based on three classification performance indexes, i.e. average number of genes selected, average number of classification errors in test set and misclassification rate, the proposed method gets slightly better classification results than SVM-RFE for the colon dataset while at a much less computation cost. It also achieves better classification results than the SWKC methods in both datasets. Finally, we discuss that in future work improvement in performance could be achieved by introducing kernel density estimators and replacing Bhattacharyya distance with Hellinger distance as a feature selection criteria. Since kernel density estimation is free of distribution assumptions, under which the classification results would be more robust than that obtained by the Bhattacharyya distance under normal assumption.
منابع مشابه
Feature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine
We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...
متن کاملTHE UNIVERSITY OF CALGARY Programming Distributed Collaboration Interaction Through the World Wide Web by Roberto Augusto Flores-Méndez A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE
............................................................................................................................. iii Acknowledgments .............................................................................................................. iv Dedications.......................................................................................................................... v T...
متن کاملDiagnosis of Breast Cancer Subtypes using the Selection of Effective Genes from Microarray Data
Introduction: Early diagnosis of breast cancer and the identification of effective genes are important issues in the treatment and survival of the patients. Gene expression data obtained using DNA microarray in combination with machine learning algorithms can provide new and intelligent methods for diagnosis of breast cancer. Methods: Data on the expression of 9216 genes from 84 patients across...
متن کاملGene Identification from Microarray Data for Diagnosis of Acute Myeloid and Lymphoblastic Leukemia Using a Sparse Gene Selection Method
Background: Microarray experiments can simultaneously determine the expression of thousands of genes. Identification of potential genes from microarray data for diagnosis of cancer is important. This study aimed to identify genes for the diagnosis of acute myeloid and lymphoblastic leukemia using a sparse feature selection method. Materials and Methods: In this descriptive study, the expressio...
متن کاملThesis Submitted in Partial Fulfillment of the requirement for the Degree of M.A/M. Sc In School consultant
Goal: The aim of this study is assess and compare emotional ability of deaf. Semi _ deaf and hearing students (14 _ 20) in Mashhad. Method: To do this experiment out of studies evidence generally 105 students selecting randomly. From each group, choose the number of normal boys and girls 35, deaf boys and girls and semi deaf boys and girls .this article is useful and explanatory .in this stud...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014